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Abstract 

Since its inception, the modus operandi of multi-task learning (MTL) has been to 
minimize the task-wise mean of the empirical risks. We introduce a generalized 
loss-compositional paradigm for MTL that includes a spectrum of formulations as 
a subfamily. One endpoint of this spectrum is minimax MTL: a new MTL formu- 
lation that minimizes the maximum of the tasks' empirical risks. Via a certain re- 
laxation of minimax MTL, we obtain a continuum of MTL formulations spanning 
minimax MTL and classical MTL. The full paradigm itself is loss-compositional, 
operating on the vector of empirical risks. It incorporates minimax MTL, its relax- 
ations, and many new MTL formulations as special cases. We show theoretically 
that minimax MTL tends to avoid worst case outcomes on newly drawn test tasks 
in the learning to learn (LTL) test setting. The results of several MTL formulations 
on synthetic and real problems in the MTL and LTL test settings are encouraging. 

1 Introduction 

The essence of machine learning is to exploit what we observe in order to form accurate predictors 
of what we cannot. A multi-task learning (MTL) algorithm learns an inductive bias to learn several 
tasks together. MTL is incredibly pervasive in machine learning: it has natural connections to ran- 
dom effects models ifTSI : user preference prediction (including collaborative filtering) can be framed 
as MTL 1 16 1; multi-class classification admits the popular one-vs-all and all-pairs MTL reductions; 
and MTL admits provably good learning in settings where single-task learning is hopeless B1 fT2l . 
But if we see examples from a random set of tasks today, which of these tasks will matter tomorrow? 
Not knowing in the present what challenges nature has in store for the future, a sensible strategy is 
to mitigate the worst case by ensuring some minimum proficiency on each task. 

Consider a simple learning scenario: A music preference prediction company is in the business of 
predicting what 5-star ratings different users would assign to songs. At training time, the com- 
pany learns a shared representation for predicting the users' song ratings by pooling together the 
company's limited data on each user's preferences. Given this learned representation, a separate 
predictor for each user can be trained very quickly. At test time, the environment draws a user 
according to some (possibly randomized) rule and solicits from the company a prediction of that 
user's preference for a particular song. The environment may also ask for predictions about new 
users, described by a few ratings each, and so the company must leverage its existing representation 
to rapidly learn new predictors and produce ratings for these new users. 

Classically, multi-task learning has sought to minimize the (regularized) sum of the empirical risks 
over a set of tasks. In this way, classical MTL implicitly assumes that once the learner has been 
trained, it will be tested on test tasks drawn uniformly at random from the empirical task distribution 
of the training tasks. Notably, there are several reasons why classical MTL may not be ideal: 

'Work completed while at Georgia Institute of Technology 
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• While at training time the usual flavor of MTL commits to a fixed distribution over users (typi- 
cally either uniform or proportional to the number of ratings available for each user), at test time 
there is no guarantee what user distribution we will encounter. In fact, there may not exist any 
fixed user distribution: the sequence of users for which ratings are elicited could be adversarial. 

• Even in the case when the distribution over tasks is not adversarial, it may be in the interest of 
the music preference prediction company to guarantee some minimum level of accuracy per user 
in order to minimize negative feedback and a potential loss of business, rather than maximing 
the mean level of accuracy over all users. 

• Whereas minimizing the average prediction error is very much a teleological endeavor, typically 
at the expense of some locally egregious outcomes, minimizing the worst-case prediction error 
respects a notion of fairness to all tasks (or people). 

This work introduces minimax multi-task learning as a response to the above scenario^ In addi- 
tion, we cast a spectrum of multi-task learning. At one end of the spectrum lies minimax MTL, 
and departing from this point progressively relaxes the "hardness" of the maximum until full re- 
laxation reaches the second endpoint and recovers classical MTL. We further sculpt a generalized 
loss-compositional paradigm for MTL which includes this spectrum and several other new MTL 
formulations. This paradigm equally applies to the problem of learning to learn (LTL), in which the 
goal is to learn a hypothesis space from a set of training tasks such that this representation admits 
good hypotheses on future tasks. In truth, MTL and LTL typically are handled equivalently at train- 
ing time — this work will be no exception — and they diverge only in their test settings and hence 
the learning theoretic inquiries they inspire. 

Contributions. The first contribution of this work is to introduce minimax MTL and a continuum 
of relaxations. Second, we introduce a generalized loss-compositional paradigm for MTL which 
admits a number of new MTL formulations and also includes classical MTL as a special case. 
Third, we empirically evaluate the performance of several MTL formulations from this paradigm 
in the multi-task learning and learning to learn settings, under the task-wise maximum test risk and 
task-wise mean test risk criteria, on four datasets (one synthetic, three real). Finally, Theorem [T] 
is the core theoretical contribution of this work and shows the following: If it is possible to obtain 
maximum empirical risk across a set of training tasks below some level 7, then it is likely that the 
maximum true risk obtained by the learner on a new task is bounded by roughly 7. Hence, if the goal 
is to minimize the worst case outcome over new tasks, the theory suggests minimizing the maximum 
of the empirical risks across the training tasks rather than their mean. 

In the next section, we recall the settings of multi-task learning and learning to learn, formally 
introduce minimax MTL, and motivate it theoretically. In Section [3] we introduce a continu- 
ously parametrized family of minimax MTL relaxations and the new generalized loss-compositional 
paradigm. Section|4]presents an empirical evaluation of various MTL/LTL formulations with differ- 
ent models on four datasets. Finally, we close with a discussion. 

2 Minimax multi-task learning 

We begin with a promenade through the basic MTL and LTL setups, with an effort to abide by the 
notation introduced by Baxter |4|. Throughout the rest of the paper, each labeled example (x,y) 
will live in X x y for input instance x and label y. Typical choices of X include R™ or a compact 
subset thereof, while y typically is a compact subset of K or the binary {—1,1}. In addition, 
define a loss function £ : M. x y M. + . For simplicity, this work considers l 2 loss (squared loss) 
^(j/'i y) — iu' ~ y) 2 f° r regression and hinge loss i{y' . y) = max{0, 1 — y'y} for classification. 

MTL and LTL often are framed as applying an inductive bias to learn a common hypothesis space, 
selected from a fixed family of hypothesis spaces, and thereafter learning from this hypothesis space 
a hypothesis for each task observed at training time. It will be useful to formalize the various sets 
and elements present in the preceding statement. Let H be a family of hypothesis spaces. Any 
hypothesis space H GM itself is a set of hypotheses; each hypothesis h <E T-L is a map h : X — > R. 



'Note that minimax MTL does not refer to the minimax estimators of statistical decision theory. 
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Learning to learn. In learning to learn, the goal is to achieve inductive transfer to learn the best 
T~L from H. Unlike in MTL, there is a notion of an environment of tasks: an unknown probability 
measure Q over a space of task probability measures V. The goal is to find the optimal representation 
via the objective 

In practice, T (unobservable) training task probability measures Pi, ... , Pt <G V are drawn iid from 
Q, and from each task t a set of m examples are drawn iid from P t . 



Multi-task learning. Whereas in learning to learn there is a distribution over tasks, in multi-task 
learning there is a fixed, finite set of tasks indexed by [T] := {1, . . . ,T}. Each task t E [T] 
is coupled with a fixed but unknown probability measure P t . Classically, the goal of MTL is to 
minimize the expected loss at test time under the uniform distribution on [T]\ 

if H t ^ lllf^-pAvM*))- (2) 

te[T] 

Notably, this objective is equivalent to ([TJ when Q is the uniform distribution on {Pi, . . . , Pt}- In 
terms of the data generation model, MTL differs from LTL since the tasks are fixed; however, just 
as in LTL, from each task t a set of m examples are drawn iid from P t . 



2.1 Minimax MTL 



A natural generalization of classical MTL results by introducing a prior distribution tt over the index 
set of tasks [T]. Given tt, the (idealized) objective of this generalized MTL is 

Wei Et ~" h£n E (^)~ p *^' h ( x ^> (3) 

given only the training data {(xt,i, yt,i)> ■ ■ ■ , (xt.m, yt,m)}te[T]- The classical MTL objective |2]) 
equals |3]l when tt is taken to be the uniform prior over [T]. We argue that in many instances, that 
which is most relevant to minimize is not the expected error under a uniform distribution over tasks, 
or even any pre-specified tt, but rather the expected error for the worst tt. We propose to minimize 
the maximum error over tasks under an adversarial choice of tt, yielding the objective: 

inf sup E t ^ inf E {x ^ Pt i{y, h{xj), 

rttH ft fit rl 

where the supremum is taken over the T-dimensional simplex. As the supremum (assuming it is 
attained) is attained at an extreme point of the simplex, this objective is equivalent to 

inf max inf E/ x v )^pJ(y, h(x)). (4) 
In practice, we approximate the true objective by using the (regularized) empirical objective: 

m 

inf max inf £(y t u h(x t ,i)). 

2 — 1 



In the next section, we motivate minimax MTL theoretically by showing that the worst-case perfor- 
mance on future tasks likely will not be much higher than the maximum of the empirical risks for 
the training tasks. In this short paper, we restrict attention to the case of finite EL 



2.2 A learning to learn bound for the maximum risk 

In this subsection, we use the following notation. Let P^\ . . . , p( T ) be probability measures drawn 
iid from Q, and for t £ [T] let z^> be an m-sample (a sample of m points) from PW with corre- 
sponding empirical measure P$. Also, if P is as a probability measure then P£oh := E£(y, h(x)); 
similarly, if P m is an empirical measure, then P m £ o h := — J^iLi ^{Vii h( x i))- 

Our focus is the learning to learn setting with a minimax lens: when one learns a representation 
H € H from multiple training tasks and observes maximum empirical risk 7, we would like to 
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guarantee that H's true risk on a newly drawn test task will be bounded by roughly 7. Such a goal is 
in striking contrast to the classical emphasis of learning to learn, where the goal is to obtain bounds 
on Ws expected true risk. Using Ws expected true risk and Markov's inequality, Baxter 0] the 
display prior to (25) ] showed that the probability that Ws true risk on a newly drawn test task is 
above some level 7 decays as the expected true risk over 7: 

Pr\ inf PI o h > 7 \ < T el 1 (5) 

[hen J 7 

where the size of e is controlled by T, m, and the complexities of certain spaces. 

The expected true risk is not of primary interest for controlling the tail of the (random) true risk, 
and a more direct approach yields a much better bound. In this short paper we restrict the space of 
representations H to be finite with cardinality C; in this case, the analysis is particularly simple and 
illuminates the idea for proving the general case. The next theorem is the main result of this section: 
Theorem 1. Let |H| = C, and let the loss t be L-Lipschitz in its second argument and bounded by 
B. Suppose T tasks P {1 \ . . . , are drawn iidfrom Q and from each task P^> an iid m-sample 
i s drawn. Suppose there exists Wei such that all t G [T] satisfy min^g^ Pm £ h < 7. Let 
P be newly drawn probability measure from Q. Let h be tlie empirical risk minimizer over the test 
m-sample. With probability at least 1 — 8 with respect to the random draw of the T tasks and their 
T corresponding m-samples: 

1 1 , s /slog 4 I log^+logrsl +log(T + l) 
Pr { Pi /, > - + - + 2Lm^n m (H) + ^ < -^J *LJ ^ >. ( 6 ) 

In the above, lZ m {H) is the Rademacher complexity of H (cf. Q). Critically, in (|6]l the probability 
of observing a task with high true risk decays with T, whereas in <j5j the decay is independent of T. 
Hence, when the goal is to minimize the probability of bad performance on future tasks uniformly, 
this theorem motivates minimizing the maximum of the empirical risks as opposed to their mean. 

For the proof of Theorem[T] first consider the singleton case H = {Hi}. Suppose that for 7 fixed a 
priori, the maximum of the empirical risks is bounded by 7, i.e. max te p-] min^g^j Pm t h < 7. 

Let a new probability measure P drawn from Q correspond to a new test task. Suppose the prob- 
ability of the event [mhihcHi Pm( h > 7] is at least e. Then the probability that 7 bounds all T 
empirical risks is at most (1 — e) T < e~ Te . Hence, with probability at least 1 — e~ Te : 

Pr {min, ieWl P m £ o h > 7} < e. (7) 
A simple application of the union bound extends this result for finite H: 

Lemma 1. Under the same conditions as Theorem^ with probability at least 1 — 8/2 with respect 
to the random draw of the T tasks and their T corresponding m-samples: 

log 2C 



Pr|minP m £o/ l > 7 | < -|J>-. (8) 

The bound in the lemma states a 1/T rate of decay for the probability that the empirical risk obtained 
by H on a new task exceeds 7. Next, we relate this empirical risk with the true risk obtained by the 
empirical risk minimizer. Note that at test time H is fixed and hence independent of any test m- 
sample. Then, from by now standard learning theory results of Bartlett and Mendelson J5): 

Lemma 2. Take loss I as in Theorem^ With probability at least 1 — 8/2, for all h € H uniformly: 
Ploh< P m £ oh + 2LK m (H) + v/(81og(4/(5))/m. (9) 



In particular, with high probability the true risk of the empirical risk minimizer is not much larger 
than its empirical risk. Theorem [T] now follows from Lemmas [T] and [2] and a union bound over 
7 £ T := {0, 1/T, 2/T, . . . , \B~\ }; note that mapping the observed maximum empirical risk 7 to 
min{7' g T \ 7 < 7'} picks up the additional ^ term in |6|. 

In the next section, we introduce a loss-compositional paradigm for multi-task learning which in- 
cludes as special cases minimax MTL and classical MTL. 
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3 A generalized loss-compositional paradigm for MTL 

The paradigm can benefit from a bit of notation. Given a set of T tasks, we represent the empirical 
risk for hypothesis h t E H (€ H) on task t G [T] as £t(ht) := J3i=i ^{Vt,i: ht( x t,i))- Additionally 
define a set of hypotheses for multiple tasks h := (hi, . . . , hx) € H T and the vector of empirical 
risks £(h) := (4(/ii),...,It(/it)). 

With this notation set, the proposed loss-compositional paradigm encompasses any regularized min- 
imization of a (typically convex) function <fi : ET — > R+ of the empirical risks: 

inf inf (f>(i(h))+n((n,h)), (10) 
where £!(■) : H x \J-^ e ^H T — ► R+ is a regularizes 



£ p MTL. One notable specialization that is still quite general is the case when cf> is an ^ p -norm, 
yielding £ p MTL. This subfamily encompasses classical MTL and many new MTL formulations: 

• Classical MTL as l\ MTL: 

inf inf - V l(h t ) + Q((H,h)) = mf in f -\\£(h)\U+ Sl((H,h)). 

t£[T] 

• Mnimax MTL as MTL: 

inf inf maxi(h t ) + n((H,h)) = inf inf MfhVU +Sl((H,h)). 

«eHhew T te[T] v y «eihe« r v 7 

• A new formulation, £2 MTL: 

inf inf (1 V (i(h t )) 2 ) 1/2 + n((H,h)) = inf inf -^||i(h)|| 2 +Sl((U, h)). 



te[T] 



A natural question is why one might consider minimizing ^ p -norms of the empirical risks vector for 
1 < p < 00, as in £2 MTL. The contour of the ^i-norm of the empirical risks evenly trades off 
empirical risks between different tasks; however, it has been observed that overrating often happens 
near the end of learning, rather than the beginning lfl4ll . More precisely, when the empirical risk is 
high, the gradient of the empirical risk (taken with respect to the parameter (H,h)) is likely to have 
positive inner product with the gradient of the true risk. Therefore, given a candidate solution with a 
corresponding vector of empirical risks, a sensible strategy is to take a step in solution space which 
places more emphasis on tasks with higher empirical risk. This strategy is particularly appropriate 
when the class of learners has high capacity relative to the amount of available data. This observation 
sets the foundation for an approach that minimizes norms of the empirical risks. 

In this work, we also discuss an interesting subset of the loss-compositional paradigm which does 
not fit into £ p MTL; this subfamily embodies a continuum of relaxations of minimax MTL. 

a-minimax MTL. In some cases, minimizing the maximum loss can exhibit certain disadvan- 
tages because the maximum loss is not robust to situations when a small fraction of the tasks are 
fundamentally harder than the remaining tasks. Consider the case when the empirical risk for each 
task in this small fraction can not be reduced below a level u. Rather than rigidly minimizing the 
maximum loss, a more robust alternative is to minimize the maximize loss in a soft way. Intu- 
itively, the idea is to ensure that most tasks have low empirical risk, but a small fraction of tasks are 
permitted to have higher loss. We formalize this as a-minimax MTL, via the relaxed objective: 

minimize minlfeH — max{0, 1* (ht) — b}\ + VliCH. h)) . nn 

te[T] 

In the above, <f> from the loss-compositional paradigm ( fT0] > is a variational function of the empirical 
risks vector. The above optimization problem is equivalent to the perhaps more intuitive problem: 

minimize b + - 6 + £l((U, h)) subject to £ t (h t ) <b + £ t , t G PI- (12) 

■Heu.heu T ,b>o,£>o a 

- te[T] 
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Here, b plays the role of the relaxed maximum, and each £ t 's deviation from zero indicates the 
deviation from the (loosely enforced) maximum. We expect £ to be sparse. 

To help understand how a affects the learning problem, let us consider a few cases: 

(1) When a > T, the optimal value of b is zero, and the problem is equivalent to classical MTL. To 
see this, note that for a given candidate solution with b > the objective always can be reduced 
by reducing b by some e and increasing each £ t by the same e. 

(2) Suppose one task is much harder than all the other tasks (e.g. an outlier task), and its empirical 
risk is separated from the maximum empirical risk of the other tasks by p. Let 1 < a < 2; now, 
at the optimal hard maximum solution (where £ = 0), the objective can be reduced by increasing 
one of the £ t 's by p and decreasing b by p. Thus, the objective can focus on minimizing the 
maximum risk of the set of T — 1 easier tasks. In this special setting, this argument can be 
extended to the more general case k < a < k + 1 and k outlier tasks, for k £ [T]. 

(3) As a approaches 0, we recover the hard maximum case of minimax MTL. 

This work focuses on a-minimax MTL with a = 2/(["0.1T + 0.5] -1 + [0.1T + 1.5] _1 ) i.e. the 
harmonic mean of [0.1T + 0.5] and \0.1T + 1.5] . The reason for this choice is that in the idealized 
case (2) above, for large T this setting of a makes the relaxed maximum consider all but the hardest 
10% of the tasks. We also try the 20% level (i.e. 0.2T replacing 0.1T in the above). 

Models. We now provide examples of how specific models fit into this framework. We consider 
two convex multi-task learning formulations: Evgeniou and Pontil's regularized multi-task learning 
(the EP model) and Argyriou, Evgeniou, and Pontil's convex multi-task feature learning (the 
AEP model) [1J. The EP model is a linear model with a shared parameter vq £ R d and task-specific 
parameters v t £ R d (for t £ [T]). Evgeniou and Pontil presented this model as 

min «o,K} te [T] Et e [T] ThLi £{Vt,i, ( v o + v t ,x t ,i)) + AolK'oll 2 + tt Ete[r] IMI 2 > ( 13 > 
for £ the hinge loss or squared loss. This can be set in the new paradigm via H = {H V(I \ vq £ R d }, 
n vo = {h: (v + v t ,x) | v t £ R d }, and £ t (h t ) = i E™i^M' ( v + v t , x t ,i)) . 
The AEP model minimizes the task-wise average loss with the trace norm (nuclear norm) penalty: 

nW Et YZi Kyt,i, (Wt,xt ti )) + X\\W\\ tI , (14) 
where || • || tr : W H > Ej Vi^W) 15 tne trace norm. In the new paradigm, HI is a set where each element 
is a fc-dimensional subspace of linear estimators (for k <C d). Each h t = W t in some H £ M lives 
in Ws corresponding low-dimensional subspace. Also, £t(ht) = — El=i ^{vt,i: {ht: x t,i))- 

For easy empirical comparison between the various MTL formulations from the paradigm, at times 
it will be convenient to use constrained formulations of the EP and AEP model. If the regularized 
forms are used, a fair comparison of the methods warrants plotting results according to the size of 
the optimal parameter found (i.e. ||W^||t r for AEP). For EP, the constrained form is: 

min v ,{v t } te[T] Ete[T] Eili t(Vt,i> (vo + v u x t<i )) subject to ||u || < t , ||u t || < r 1 foit£ [T]. 
For AEP, the constrained form is: min^/ E t EI"=i ^{Vt,ii {Wt, xt,i)) subject to ||VF||tr < r - 

4 Empirical evaluation 

We consider four learning problems; the first three involve regression (MTL model in parentheses): 

• A synthetic dataset composed from two modes of tasks (EP model), 

• The school dataset from the Inner London Education Authority (EP model), 

• The conjoint analysis personal computer ratings dataset[^]| 1 1 1 (AEP model). 

The fourth problem is multi-class classification from the MNIST digits dataset [ 10 1 with a reduction 
to multi-task learning using a tournament of pairwise (binary) classifiers. We use the AEP model. 
Given data, each problem involved a choice of MTL formulation (e.g. minimax MTL), model (EP or 
AEP), and choice of regularized versus constrained. All the problems were solved using just a few 
lines of code using CVX 10 H). in this work, we considered convex multi-task learning formulations 
in order to make clear statements about the optimal solutions attained for various learning problems. 

2 This data, collected at the University of Michigan MBA program, generously was provided by Peter Lenk. 
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Figure 1: Max ^2-risk (Top two lines) and mean l^-risk (Bottom two lines). At Left and Center: ^2-risk vs 
noise level, for Otask = 0.1 and a tas k = 0.5 respectively. At Right: ^2-risk vs task variation, for cr no i sc — 0.1. 
Dashed red is l\, dashed blue is minimax. Error bars indicate one standard deviation. MTL results (not shown) 
were similar to LTL results (shown), with MTL-LTL relative difference below 6.8% for all points plotted. 



Two modes. The two modes regression problem consists of 50 linear prediction tasks for the first 
type of task and 5 linear prediction tasks for the second task type. The true parameter for the first 
task type is a vector fi drawn uniformly from the sphere of radius 5; the true parameter for the 
second task type is — 2/i. Each task is drawn from an isotropic Gaussian with mean taken from the 
task type and the standard deviation of all dimensions set to Ctask- Each data point for each task is 
drawn from a product of 10 standard normals (so x t ,i G R 10 ). The targets are generated according 
to (Wt,x t ,i) + et, where the e t 's are iid univariate centered normals with standard deviation er no j sc . 
We fixed tq to a large value (in this case, To — 10 is sufficient since the mean for the largest task 
fits into a ball of radius 10) and t\ to a small value (j\ = 2). We compute the average mean and 
maximum test error over 100 instances of the 55-task multi-task problem. Each task's training set 
and test set are 5 and 15 points respectively. The average maximum (mean) test error is the 100- 
experiment-average of the task-wise maximum (mean) of the £2 risks. For each LTL experiment, 55 
new test tasks were drawn using the same /1 as from the training tasks. 

Figure [T] shows a tradeoff: when each task group is fairly homogeneous (left and center plots), 
minimax is better at minimizing the maximum of the test risks while l\ is better at minimizing the 
mean of the test risks. As task homogeneity decreases (right plot), the gap in performance closes 
with respect to the maximum of the test risks and remains roughly the same with respect to the mean. 




Figure 2: Maximum RMSE (Left) and normalized mean RMSE (Right) versus task-specific parameter bound 
Ti, for shared parameter bound to fixed. In each figure, Left section is to is 0.2 and Right section is to = 0.6. 
Solid red ♦ is l\, solid blue • is minimax, dashed green A is (O.lT)-minimax, dashed black T is (0.2T)- 
minimax. The results for £2 MTL were visually identical to £\ MTL and hence were not plotted. 

School. The school dataset has appeared in many previous works 17]|2]|6 |. For brevity we just say 
the goal is to predict student test scores using certain student-level features. Each school is treated as 
a separate task. We report both the task-wise maximum of the root mean square error (RMSE) and 
the taskwise-mean of the RMSE (normalized by number of points per task, as in previous works). 

The results (see Figure |2| demonstrate that when the learner has moderate shared capacity To and 
high task-specific capacity n, minimax MTL outperforms £\ MTL for the max objective; addition- 
ally, for the max objective in almost all parameter settings (O.lT)-minimax and (0.2T) -minimax 
MTL outperform £\ MTL, and they also outperform minimax MTL when the task-specific capacity 
n is not too large. We hypothesize that minimax MTL performs the best in the high— t\ regime be- 
cause stopping learning once the maximum of the empirical risks cannot be improved invokes early 
stopping and its built-in regularization properties (see e.g. lfL3ll ). Interestingly, for the normalized 
mean RMSE objective, both minimax relaxations are competitive with £\ MTL; however, when the 
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shared capacity tq is high (right section, right plot), l\ MTL performs the best. For high task-specific 
capacity n, minimax MTL and its relaxations again seem to resist overfitting compared to £\ MTL. 



Personal computer. The personal 
computer dataset is composed of 189 
human subjects each of which rated 
on a 0-10 scale the same 20 comput- 
ers (16 training, 4 test). Each com- 
puter has 13 binary features (amount 
of memory, screen size, price, etc.). 

The results are shown in Figure [3] In 
the MTL setting, for both the maxi- 
mum RMSE objective and the mean 
RMSE objective, l x MTL appears to 
perform the best. When the trace 
norm of W is high, minimax MTL 
displays resistance to overfitting and 
obtains the lowest mean RMSE. In 
the LTL setting for the maximum 
RMSE objective, £2, minimax, and 
(O.lT)-minimax MTL all outperform 
l x MTL. For the mean RMSE, l x 
MTL obtains the lowest risk for al- 
most all parameter setttings. 
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MNIST. The MNIST task is a 10-class problem; we ap- 
proach it via a reduction to a tournament of 45 binary clas- 
sifiers trained via the AEP model. The dimensionality was 
reduced to 50 using principal component analysis (com- 
puted on the full training set), and only the first 2% of 
each class's training points was used for training. 

Intuitively, the performance of the tournament tree of bi- 
nary classifiers can only be as accurate as its paths, and 
the accuracy of each path depends on the accuracy of 
the nodes. Hence, our hypothesis is that minimax MTL 
should outperform £ x MTL. The results in Figure [4] con- 
firm our hypothesis. Minimax MTL outperforms l\ MTL 
when the capacity ||W^||tr is somewhat limited, with the 
gap widening as the capacity decreases. Furthermore, at 
every capacity minimax MTL is competitive with l\ MTL. 

5 Discussion 



Figure 3: MTL (Top) and LTL (Bottom). Maximum £ 2 risk (Left) 
and Mean £2 risk (Right) vs bound on ||W^||tr- LTL used 10-fold 
cross-validation (10% of tasks left out in each fold). Solid red ♦ is 
£\, solid blue • is minimax, dashed green A is (O.lT)-minimax, 
dashed black T is (0.2T)-minimax, solid gold is £2- 
0.5 



™ °- 3 

= 0.25 

E 

to 0.2 
h- 

0.15 




40 60 80 100 120 140 
trace norm of W 

Figure 4: Test multiclass 0-1 loss vs 
||W]|tr. Solid red is l x MTL, solid blue is 
minimax, dashed green is (O.lT)-minimax, 
dashed black is (0.2T)-minimax. Regular- 
ized AEP used for speed and trace norm of 
W's computed, so samples differ per curve. 



We have established a continuum of formulations for MTL which recovers as special cases classical 
MTL and the newly formulated minimax MTL. In between these extreme points lies a continuum of 
relaxed minimax MTL formulations. More generally, we introduced a loss-compositional paradigm 
that operates on the vector of empirical risks, inducing the additional £ p MTL paradigms. The 
empirical evaluations indicate that a-minimax MTL at either the 10% or 20% level often outperform 
l\ MTL in terms of the maximum test risk objective and sometimes even in the mean test risk 
objective. All the minimax or a-minimax MTL formulations exhibit a built-in safeguard against 
overfitting in the case of learning with a model that is very complex relative to the available data. 

Although efficient algorithms may make the various new MTL learning formulations practical for 
large problems, a proper effort to develop fast algorithms in this setting would have detracted from 
the main point of this first study. A good direction for the future is to obtain efficient algorithms 
for minimax and a-minimax MTL. In fact, such algorithms might have applications beyond MTL 
and even machine learning. Another area ripe for exploration is to establish more general learning 
bounds for minimax MTL and to extend these bounds to a-minimax MTL. 
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